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Interactions among multiple genes across the genome may con- 
tribute to the risks of many complex human diseases. Whole-genome 
single nucleotide polymorphisms (SNPs) data collected for many thou- 
sands of SNP markers from thousands of individuals under the case- 
control design promise to shed light on our understanding of such 
interactions. However, nearby SNPs are highly correlated due to link- 
age disequilibrium (LD) and the number of possible interactions is 
too large for exhaustive evaluation. We propose a novel Bayesian 
method for simultaneously partitioning SNPs into LD-blocks and se- 
lecting SNPs within blocks that are associated with the disease, ei- 
ther individually or interactively with other SNPs. When applied to 
homogeneous population data, the method gives posterior probabili- 
ties for LD-block boundaries, which not only result in accurate block 
partitions of SNPs, but also provide measures of partition uncer- 
tainty. When applied to case-control data for association mapping, 
the method implicitly filters out SNP associations created merely by 
LD with disease loci within the same blocks. Simulation study showed 
that this approach is more powerful in detecting multi-locus associ- 
ations than other methods we tested, including one of ours. When 
applied to the WTCCC type 1 diabetes data, the method identified 
many previously known TID associated genes, including PTPN22, 
CTLA4, MHC, and IL2RA. The method also revealed some interest- 
ing two-way associations that are undetected by single SNP methods. 
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Most of the significant associations are located witfiin the MHG re- 
gion. Our analysis showed that the MHG SNPs form long-distance 
joint associations over several known recombination hotspots. By con- 
trolling the haplotypes of the MHG class H region, we identified addi- 
tional associations in both MHG class I {HLA-A, HLA-B) and class 
HI regions (BATl). We also observed significant interactions between 
genes PRSS16, ZNF184 in the extended MHG region and the MHG 
class II genes. The proposed method can be broadly applied to the 
classification problem with correlated discrete covariates. 

1. Introduction. A recent genome-wide association (GWA) study of 
14,000 cases of seven human genetic diseases and 3,000 shared controls by 
the Welcome Trust Case Control Consortium [WTCCC (2007)] represents 
a thorough validation of the GWA approach. By testing hundreds of thou- 
sands of single nucleotide polymorphisms (SNPs) in the human population 
(Affymetrix 500k SNP), the study has identified many SNPs associated with 
seven complex diseases [WTCCC (2007)]. Each SNP consists of two alle- 
les taking values or 1, and there are three possible combinations of the 
two alleles: (0,0), (0,1), (1,1), disregarding the order. Each combination is 
called a genotype, representing wild homozygote, heterozygote, and mutant 
homozygote, respectively. In a case-control study, a SNP is said to be asso- 
ciated with the disease if the genotype (or allele) distribution at the SNP is 
different between cases and controls. In addition to testing individual SNPs, 
it has also been anticipated that epistatic interactions among SNPs, defined 
as multiple SNPs jointly associated with the disease, may be responsible 
for significantly elevating the risks of some human complex diseases. Due to 
computational and methodological limitations, however, efforts on detect- 
ing disease-related epistatic interactions among SNPs in the WTCCC study 
have been limited. 

In the past few years, many approaches have been developed for case- 
control studies to detect epistasis associations. Most methods cannot be 
applied to GWA studies due to their computational limitations except for 
some recently developed methods, such as the stepwise logistic regression 
method [Marchini, Donnelly and Cardon (2005)] and the Bayesian epistasis 
association mapping (BEAM) algorithm [Zhang and Liu (2007)]. It has been 
demonstrated that BEAM is capable of detecting high-order interactions in 
GWA studies and is more powerful than other existing methods [Zhang and 
Liu (2007)]. A limitation of BEAM, however, is its model assumption that 
a Markov chain can capture the dependence structure of the SNPs in the 
data. It is well known that linkage disequilibrium (LD) between adjacent 
SNPs exhibits block-wise structure in the human genome [The International 
HapMap Consortium (2005), Reich et al. (2001)]. SNPs within blocks are 
highly correlated and the correlation is broken down by recombination events 
at block boundaries. A simple Markov model cannot capture this important 
block structure when analyzing dense SNPs. 
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Previous studies have shown that using haplotypes, defined as allele combi- 
nations over multiple nearby SNPs inherited from one of the parents, cannot 
only reduce the high computation cost in GWA studies, but also improve 
the detection power of association mapping [Kuno et al. (2004); Zollner and 
Pritchard (2005); Johnson et al. (2001); Zhang et al. (2002a); de Bakker et 
al. (2005)]. In particular, Nielsen et al. (2004) showed that when moder- 
ate to high levels of LD exist, haplotype tests tend to be substantially more 
powerful. Kuno et al. (2004) demonstrated in a real disease study that single- 
SNP tests were not significant even at SNP loci close to the mutation site 
(APRT*J), whereas the haplotype block data yielded sufficient statistical 
significance. 

Similar to haplotypes, we define diplotypes as genotype combinations over 
multiple nearby SNPs. Diplotypes are directly observed in GWA studies, 
whereas haplotypes have to be inferred using computationally expensive 
algorithms. Throughout the paper, we discuss our method and results on 
diplotype associations with the disease, although the method is directly ap- 
plicable to haplotype data. Our focus on diplotype association is mainly due 
to the computational concern, where inferring unobserved haplotypes will be 
extremely time consuming. It is also possible that testing diplotype associa- 
tions could be more powerful than testing haplotype associations, depending 
on the underlying disease model. On the other hand, if haplotype associa- 
tions are of the interests, users can first infer haplotypes using an available 
haplotype inference algorithm, and then input the inferred haplotypes into 
our method. The degrees of freedom of our model will automatically accom- 
modate the different inputs. 

In this paper we extend the BEAM model to address the block struc- 
tures in the human genome. We refer to SNP block structures as LD-blocks. 
By partitioning SNPs into LD-blocks, a naive extension of BEAM is to 
treat each LD-block as a genetic marker, with diplotypes in the block being 
treated as alleles. This approach, however, may not be optimal for asso- 
ciation mapping. First, criteria utilized in existing block-partitioning algo- 
rithms do not directly aim at optimizing the power of association mapping. 
Second, many regions in the human genome demonstrate vague structural 
patterns, of which a measure of uncertainty in block structures should be 
provided. Simulation studies have shown that LD-block structures can be 
affected by marker density [Wang et al. (2002); Pilhps et al. (2003); WaU 
and Pritchard (2003)], population structure [Wang et al. (2002); Stumpf 
and Goldstein (2003); Zhang et al. (2003); Anderson and Slatkin (2004)], 
and gene conversion [Przeworski and Wall (2001)]. 

We propose a Bayesian model to simultaneously infer LD-blocks and select 
SNPs within blocks for disease association mapping. The model partitions 
the genome into discrete blocks, within which the diversity of diplotypes is 
limited. Block structures are iteratively updated such that disease associa- 
tions are detected and summarized from a variety of likely block partitions. 
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This approach takes into account the uncertainty in block structures and 
optimizes the detection power by searching for the best partitions around 
disease-associated SNPs. Our method detects combinations of SNPs within 
and between blocks for marginal and epistatic associations to the disease 
status. Using LD-blocks, our method also automatically filters out artifi- 
cial associations created merely by LD with nearby authentically associated 
SNPs. We show that the new method, BEAM2, is more powerful than the 
original BEAM (renamed BEAMl henceforth). 

By applying BEAM2 to the type 1 diabetes (TID) data from WTCCC 
(2007), we obtained all the previously identified single SNP associations in 
the WTCCC TID data. We further observed some interesting two-way joint 
associations not detectable by single-SNP methods. The strongest TID asso- 
ciations occur in the well-known Major Histocompatibility Complex (MHC) 
region, in which we observed long-distance joint association patterns over 
several millions of base pairs (Mb). Since this pattern may be partially 
caused by the extended LD from the MHC class H region, we further con- 
trolled the structure of MHC class H using a logistic regression model, and 
tested additional effects of SNPs over the extended MHC region as well as 
the MHC class I and class HI regions. We observed strong associations in 
the MHC class I and class HI regions, and found significant interaction asso- 
ciations between genes PRSS16, ZNF184 in the extended MHC region and 
the MHC class II genes. 

The article is organized as follows. In Section 2 we first introduce a LD- 
block model for the genotype distribution among multiple SNPs. We model 
the genotype distribution within a block of SNPs by multinomial distribu- 
tions and assume that the joint distribution of all SNPs is the product of 
the distributions of individual blocks (i.e., assuming block independence). 
In Section 3 we extend the LD-block model to incorporate disease associ- 
ated SNPs and epistasis and describe Monte Carlo Markov chain (MCMC) 
algorithms to make inference from the joint model for both LD-blocks and 
disease associations. In Section 4 we briefly review our previously developed 
Bayes factor-based test statistic, which is used to further evaluate the sta- 
tistical significance of candidate epistasis detected by BEAM2, and discuss 
extensions of BEAM and BEAM2 for general classification problems. In Sec- 
tion 5 we demonstrate the superior performance of BEAM2 by simulation 
studies and real data applications. In Section 6 we report our results from 
applying BEAM2 to the T2D data from WTCCC (2007). We conclude the 
article with a short discussion in Section 7. More implementation details can 
be found in Supplemental Material [Zhang, Zhang and Liu (2011)]. 

2. A Bayesian model for LD-block inference. The data of interest consist 
of genotypes at a total of L SNP markers (or L covariates, each taking on 
3 possible values) observed in A'^^ case and control individuals. Let D = 
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{Di , . . . ,Dl) denote a N^x L matrix of case genotypes, and U = {Ui , . . . ,Ul) 
denote a x L matrix of control genotypes, where Di and Ui denote vectors 
of genotypes observed at SNP i across individuals. 

2.1. Bayesian LD-block model. Here we introduce a Bayesian LD-block 
partition model without considering disease association. Hence, we treat 
cases (D) and controls (U) as coming from the same population and use 
the combined data {D, U) to describe the model. A diplotype for an LD- 
block of SNPs is defined as a particular genotype combination of all the 
SNPs in that block. We seek to partition the L markers into B consecu- 
tive blocks, so that the number of observed diplotypes within each block is 
small (strong correlation), and correlation between SNPs in different blocks 
is weak. Block partitions can be quite ambiguous in many genomic regions 
and can vary across samples in details. Diplotype block structures obtained 
from available software are often based on ad hoc criteria that neither re- 
sult in a proper uncertainty measure nor optimize the association mapping 
power. 

The block variable B in our model consists of L binary indicators corre- 
sponding to the L SNPs in the data. An indicator is equal to 1 if the corre- 
sponding SNP is the start position of a block, and otherwise. As a result, B 
uniquely defines a partition of SNPs into consecutive blocks. For a diplotype 
block [a, b) consisting of SNPs a, . . . , 6 — 1, we let (n/^, m/j) denote the counts 
of a particular diplotype h in cases and controls, respectively. There are 3*~" 
possible diplotypes in the block. We assume that the diplotype of each indi- 
vidual follows independently from a multinomial distribution with frequency 
parameters {ph}, and {ph} follows a Dirichlet prior distribution, Dir({a/i}), 
where {oh} denotes a hyper-parameter (i.e., pseudo-counts). More precisely, 
letting {rih + nih} denote the combined counts of diplotype h observed in 
cases and controls, we have 

Pr(K + m,}|{p,}) = n Ph'^"^" Pr({p4) = Jj^'] J] Ph~'^ 

h=l llh=l ^{(^h) h=l 

where the subscript '•' denotes the sum of values over all subscripts. We can 
integrate out {ph} by 

Pr({n/, + mh}\{ph}) Pii{ph})d{{ph}) 

36-a 



Noting that the integrant on the right-hand side is proportional to the den- 
sity function of Dir(n/i + ruh + ah) up to its normalizing constant, we obtain 
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the marginal probability of the combined data of block [a, b) as 
P{Dia,b),Uia,b)\[a,b) forms a block) 

(1) 

/nb — a 

T{nh + mh + ah) \ r(a.) 



n 



T{ah) J T{n,+m, + a,) 

We set ah = p/3^~"' for diplotype h, and by default, we let p = 1.5. Based 
on the likelihood equivalence principle [Heckerman, Geiger and Chickering 
(1995)], {ah} is chosen to be inversely proportional to the total number of 
possible diplotypes in a block, and, hence, the sum a, remains a constant 
over different block sizes. Note that formula (1) can also be used to model ca- 
se data (-D) or control data {U) only, which is simply done by letting nih = 0, 
or Hh = 0, respectively, for all h. Further assuming independence between 
blocks (which is not entirely true, but serves as a good approximation), the 
probability function P{D,U\B) of genotypes of all blocks can be expressed 
as the product of individual block probabilities defined in formula (1). 

2.2. Bayesian inference of SNP association based on blocks. A saturated 
test of disease association for a diplotype block of Mz consecutive SNPs 
involves 3^ — 1 free parameters. When M is moderately large (M > 3), the 
power of the test becomes exceedingly low. We propose a Bayesian model 
of disease associations using only a subset of markers in the block. Here, we 
only discuss joint associations for SNPs within a diplotype block, and we 
will address epistatic interactions in the next section. It therefore suffices to 
describe our model for only one block. 

Let {M} denote the set of M SNPs in the block. We assume that only 
a subset {x} of SNPs of size x (often is or 1) are truly associated with 
the disease, and the SNPs in {M}\{a;} are not associated with the disease 
given {x}. Thus, the diplotypes of SNPs in {x} are distributed differently 
and hence are modeled by two different distributions for cases and controls, 
respectively. Conditional on the diplotypes of SNPs in {x}, the diplotypes of 
SNPs in {M}\{x}, however, follow a common distribution between cases and 
controls. The joint probability of the block data can therefore be expressed as 

(2) P{D{M},U{M}) = P{D{..})P{U{^})P{D{M}\{.},U{M}\{.} I^w, f/w). 

Here, P{D^,j.y) and -P(C^{x}) are modeled by two independent multinomial- 
Dirichlet distributions specified in formula (1), treating all SNPs in {x} 
jointly as a block. To model PiD^Mj\^^j ,U^M}\{x}\D{x}yU^x}) > we combine 
the diplotypes of SNPs in {Af }\{x} in cases and controls together. These 
diplotypes are not directly associated with the disease given {x}, and thus 
have the same conditional distributions between cases and controls. Condi- 
tional on each possible diplotype h of SNPs in {x}, we model the conditional 
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diplotype distribution of SNPs in {M}\{a;} again by a multinomial-Dirichlet 
distribution. It is straightforward to derive the expression 

P{D{M}\{x},Ui^M}\{x}\D{x},U[x}) = P{D^]^iy,U{M})/P{D{x},U[x}), 

where i^(-C){j\/}, f/{j\/}) and [/{^}) are specified in formula (1), treat- 

ing {M} and {x} as blocks, respectively. We note, however, that the marker 
set {x} is unknown a priori, and needs to be inferred jointly with other 
parameters in our Bayesian model. 

3. Joint inference of diplotype blocks and disease association. 

3.1. The joint model. To further incorporate epistatic interactions in for- 
mula (2), and to identify which SNPs are associated with the disease (re- 
sponse), we partition all SNPs (not blocks) into three groups as in BEAM 
[Zhang and Liu (2007)]. We introduce a latent L-dimensional indicator vari- 
able (/) to represent the group memberships of the L markers. For each 
marker i, li = 0,1,2 denotes three possible group memberships. SNPs be- 
longing to group-2 are assumed to be jointly associated with the disease, 
that is, epistasis, which are modeled by two joint multinomial distributions 
on the diplotypes over all group-2 SNPs — one for cases and one controls. 
SNPs belonging to group- 1 are assumed to be marginally associated with 
the disease if they belong to different blocks, and are modeled by mutually 
independent multinomial distributions conditional on the case-control sta- 
tus and the block structure. If multiple group-1 SNPs fall into one block, 
we model their diplotypes jointly, that is, group-1 SNPs within blocks be- 
come dependent of each other. If there are both group-1 and group-2 SNPs 
within one block, we model the diplotypes of group-1 SNPs within the block 
conditional on the diplotypes of the group-2 SNPs within the block. SNPs 
belonging to group-0 are the remaining SNPs unrelated to the disease status. 
We again model the distribution of group-0 SNPs within a block by multi- 
nomial distributions, with common parameters for cases and controls. We 
further assume conditional independence of group-0 SNPs between blocks, 
conditional on the group-1 and group-2 SNPs. More precisely, within each 
block, we let {X2} denote the set of group-2 SNPs, let {x} denote the union 
of the group-1 and group-2 SNPs, and let {M} denote all SNPs. We revise 
formula (2) to take the form of a conditional probability function: 

PiD{M}\{x2},U{M}\{x2}\D{x2},U{x2}) 

_ P{D{x})P{U{x})P{D{M}\{x},U{M}\{x}\D{x},U{x}) 
P(I)|,,|)P(f/|,,|) 

Thus, group-0 and group-1 SNPs are no longer mutually independent as in 
BEAMl, but are related to each other via the block structure. With epistasis 
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considered, the mutual independence between blocks in model (2) becomes 
conditional independence given group-2 SNPs. For notational simplicity, we 
omit variable / in (3), but both {x} and {2:2} are determined by I. 

Given a particular block partition B and SNP group memberships /, we 
express the joint probability function of the entire case-control data as 

P{D,U\B,I) = P{D2\B,I)P{U2\B,I) 

(4) 

X JJ^ P{D{M}\{x2}^U{M}\{x2}\D{x2},Ui^^^},B,I), 
{M}=[a,b)eB 

where D2 and U2 denote the case and control genotypes of group-2 SNPs, 
respectively, and the product term is defined in formula (3). 

3.2. Choice of prior distributions. We set the prior distribution of the 
block variable B as the product of independent Bernoulli probabilities P{B) = 
p\^\[\ — where \B\ denotes the sum of indicators in B. According 

to the block distributions estimated in European and Asian populations by 
Gabriel et al. (2002), we assume that there are 50,000 blocks in the human 
genome a priori, and thus we set p = min(0.5, 50,000i?/ (3 x lO^L)). Here, R 
denotes the length of the region spanned by the L SNPs, and 3 x 10^ is 
the length of the human genome. A smaller value of p will help the method 
identify larger blocks, and a larger p will tend to identify smaller blocks. 
As the sample size (number of individuals) increases, however, the impact 
of the prior choices diminishes quickly. To avoid overfitting the blocks, we 
further impose a restriction that the maximum number of observed distinct 
diplotypes in a block must be smaller than [N^ + Nu)/^^- 

We set the prior distribution of the SNP membership variable / as a prod- 
uct of independent multinomial distributions, P{I) = ni=o^'l^"'^^~*^' ' '^^ere 
{P0iPi-,P2} denote the prior probability of each SNP belonging to group 
0, 1, and 2, respectively. By default, we set pi = P2 = min(0.1, 5/L), and 
Pq = 1 — pi — P2. That is, we assume there are 10 SNPs associated with the 
disease a priori, where 5 are marginally associated with the disease, and 5 are 
associated through epistasis. Our choice of the prior reflects that there are 
just a few SNPs truly associated with the disease in a GWA study (where 
many other significant SNPs are due to LD effects). Increasing this prior 
(and also increasing the significance level) in the BEAM2 program may help 
identify additional SNPs of moderate to low effects. To avoid overfitting in 
interaction mapping, we further set an upper bound to the order of interac- 
tions by \n2({Nd + Nu)/10). For example, when the sample size is 1,000, our 
method can detect up to 4-way interactions. Overall, changing the values 
of pi,P2 may affect the posterior distribution of SNPs in groups 1 and 2, 
but the effects will diminish as the sample size increases. 
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Finally, the joint model of the observed genotype data in cases (D) and 
controls (C/), the block variable (B), and the SNP membership (I), is writ- 
ten as 

(5) P{D, U, B, I) = P{D, U\B, I)P{B)P{I), 

where the conditional distribution of {D,U) given {B,I) is specified in for- 
mula (4). 

3.3. MCMC updates. The parameters of interest in our model are the 
block partition B and SNP membership /. We develop Metropolis-Hastings 
(MH) algorithms [Liu (2001)] to update B, and, simultaneously, we develop 
a mix of a Gibbs sampler and Metropolis-Hastings algorithm to update /. 
The posterior distribution of {B,I) is then output for further analysis. 

To explore all possible block partitions, we propose the following MH-mo- 
ves: given a current block configuration B, we randomly select a block and 

(1) divide the block into two new blocks at a random position; 

(2) merge two adjacent blocks into one block; and 

(3) randomly shift a block boundary to either left or right by k SNPs, 
where the shifting amount is constrained by other block boundaries. 

The proposed move produces a new block partition B' , and the move is 
accepted with probability r = min{l, ^pljfij^i%)q\B^B') where q{B B') 
denotes the probability of updating from B to B' , and P{D,U,B,I) is cal- 
culated from the full model (5). In our implementation, we chose the three 
types of MH moves with probabilities 0.1, 0.1, and 0.8, respectively, and we 
require a block to contain at least one SNP. 

To update the SNP membership variable I, we updated the membership /j 
of SNP i by calculating the posterior distribution of Jj = 0,1,2 given all other 
model parameters and the data. We also propose a MH-move to switch 
the group memberships of two SNPs and accept the move based on MH- 
ratios. Per MCMC iteration, we first run the Gibbs sampler to update the 
memberships of all SNPs once, and then we run the MH-sampler to switch 
each SNP in group-1 and group-2 once with SNPs in other groups. 

4. Follow-up tests and generalization of the method. 

4.1. A test of significance based on the B ayes factor. Although inference 
can be directly made from the posterior probabilities output by BEAM2, the 
users may want to further evaluate the statistical significance of the results 
in a frequentist way. In BEAM [Zhang and Liu (2007)], we developed a novel 
Bayes factor, called B-stat, to evaluate whether a SNP or a set of SNPs are 
significantly associated with the disease, where the SNP set is selected by 
BEAM2 in our case. 
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For a set of M SNPs to be tested, the null hypothesis is that all M SNPs 
are not associated with the disease. Here, M = 1,2,3, .. . represents single- 
SNP, 2- way, and 3- way interactions, etc. B-stat for the set of M SNPs is 
defined as 



Here, Pq{Dm-,Um) denotes the null genotype distribution (i.e., no disease 
association) at the M SNPs in cases and controls, and Pa{Dm, Um) denotes 
the alternative genotype distribution (disease association). Under the null 
model, we assume that the genotypes in both cases and controls follow the 
same distribution, whereas under the alternative model, they follow different 
distributions. We choose both Pq{Dm,Um) and Pa{Dm,Um) as an equal 
mixture of two distributions: one that assumes independence among the M 
SNPs in controls (and also in cases under the null model), which yields the 
product terms in formula (6), and the other that assumes a saturated joint 
distribution of all the M SNPs. Note that the form of each term in formu- 
la (6) is defined in formula (1). 

An interesting feature of B-stat is that it uses a mixture model to ac- 
commodate the possibility that the M SNPs may or may not be in linkage 
equilibrium (independence). As a result, using B-stat will be more powerful 
than using a standard likelihood ratio test or a chi-square test of associations 
when the M SNPs under testing are in LD in controls. 

We have previously shown that, under the null hypothesis of no disease 
association, B-stat follows asymptotically a shifted chi-square distribution 
with 3'^ — 1 degrees of freedom [Zhang and Liu (2007)]. The shifting pa- 
rameter can be computed explicitly, which is determined by the sample 
size (NcijNu), the interaction size M, and the Dirichlet hyper-parameter 
{ah}- Briefly speaking, the shifting parameter is proportional to —(3*^ — 
1) lii{N(iNu/{Nd + Nu)), and, thus, the larger the number of individuals col- 
lected, or the more SNPs involved in an interaction, the smaller the shifting 
parameter will be. In addition, if large hyper-parameters {at} for the diplo- 
type frequency parameters are used, the shifting parameter will be large 
too. Note that we want the B-stat to be small (e.g., <0) when the M SNPs 
are not associated with the disease, and, hence, the users should use small 
values for {a^}, such as the default values we used in our model. 

4.2. Generalization to classification problems with discrete covariates. Let 
Y be the n x 1 binary response vector, and let X = (Xi, . . . , Xp) be the nxp 
covariates matrix, with each covariate Xj taking on kj discrete (ordinal or 
categorical) values. The standard case-control genetic study setting can be 
viewed as using response variables Y (i.e., case-control status) to fish out 



(6) 



Bm = In 



Pa{Dm,Um) 
Po{Dm,Um) 



= ln 



p{DAi)[p{UAi)+njeMnm 

P{DM,UM) + Y\,eMPiDj,U,)- 
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relevant predictors Xj (i.e., SNPs). The epistasis mapping methods BEAM 
attempt to find those X's that interactively affect Y. Both BEAM and the 
block-based method BEAM2 can be easily extended to infer a classification 
model. 

The idea of both BEAM [Zhang and Liu (2007)] and BEAM2 is to parti- 
tion the p covariates in X into three nonover lapping groups, such that one 
group contains covariates unrelated with Y, and the other groups contain 
covariates either independently or jointly related with Y. The partition of 
the covariates is an unobserved latent structure. Given a particular group 
partition of the covariates /, we can compute P(X|Y,7) as in Zhang and 
Liu (2007), which is analogous to that in a naive Bayes model. BEAM2 fur- 
ther segments the covariates X into "blocks" of highly correlated variables, 
and treats blocks as mutually independent. This is achieved by introducing 
a block indicator variable B, which is updated iteratively together with the 
variable selection indicator /. 

To predict the classification of a new observation with covariates x^^ew 
based on the training data (X, Y), we compute -P(X,x„e«j|Y,y„e«j = 1) and 
P(X,x„e„|Y,y„e^ = 0), respectively, using the BEAM (or BEAM2) algo- 
rithm, and obtain the odds ratio 

P{^new |X, Y, Ynew — l)/-f (^new |X, Y, Ynew — 0) . 

The prior P{ynew = 1) can be estimated from the prior knowledge of class 
distribution, such as the prevalence of a particular disease in the popula- 
tion, which then leads to the posterior predictive probability for ynew = 1- 
A computationally more attractive way to do the computation is to output 
the latent variable partition and block partition structures (/ and B in our 
case) from their joint posterior distribution inferred by the MCMC proce- 
dure of BEAM, and then average the conditional odds over all the sampled / 
and B : 

P(x„e«,|X, Y,/,S,y„e«; = 1)/P(x„e^ |X, Y, /, y„e«; =0). 

We tested this latter approach in a preliminary study and found the results 
quite satisfactory. 

The effect of BEAM2 is somewhat analogous to that of elastic net [Zou 
and Hastie (2005)] and group lasso [Yuan and Lin (2006)]. All methods at- 
tempt to address the phenomena that groups of covariates tend to demon- 
strate associations with the response together, and within groups the covari- 
ates are highly correlated. Different from elastic net and group lasso, BEAM2 
infers the covariate groups and also the informative covariates within groups 
jointly in a coherent probability framework. As a consequence, BEAM2 al- 
lows sparse variable selection at both the group level and the individual 
variable level within groups, whereas elastic net and group lasso do sparse 
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selection only at the group level. In a recent technical report, Friedman, 
Hastie and Tibshirani (2010) attempt to achieve a similar sparse selection 
effect as BEAM2 (sparse selection at both group and individual levels) by 
introducing an additional penalty term. 

Other important distinctions between BEAM2 (or BEAM) and those 
lasso-based methods are the following: (a) the use of the naive Bayes frame- 
work to model X given Y to greatly alleviate the overfitting problem; (b) the 
ability to incorporate interaction terms without incurring a huge computa- 
tional burden (with MCMC iterations) ; and (c) the adoption of the Bayesian 
variable selection principle, which is equivalent to using a more desirable Lq 
penalty. The cost of these advantages is that both BEAM and BEAM2 
have to compute via MCMC without a guarantee of always finding the opti- 
mal solution. Empirically, however, the computational speed of BEAM and 
BEAM2 is no worse than that required by lasso-type algorithms when the 
number of covariates is large. 

5. Simulation studies and algorithm comparisons. 

5.1. Block partition of HLA data. We first used the HLA region on hu- 
man chromosome 6 to evaluate the block partitions inferred by our method. 
The HLA region is one of the few regions in the human genome in which 
recombination hotspots have been experimentally verified [Jeffreys, Ritchie 
and Neumann (2000); Jeffreys, Kauppi and Neumann (2001)]. We down- 
loaded the genotype data of 50 unrelated UK Caucasian semen donors from 
Jeffreys AJ's website. The data covers a 216 kb region with 296 genotyped 
biallelic markers spanning from the upstream of gene HLA-DNA to gene 
TAP2 in the MHC Class II region. It is known that this region contains 
several prominent recombination hotspots [Jeffreys, Ritchie and Neumann 
(2000); Jeffreys, Kauppi and Neumann (2001)]. We therefore examined the 
relationship between the experimentally verified recombination hotspots and 
the SNP-block boundaries inferred by BEAM2. We used both haplotypes 
and genotypes to evaluate our method, where the HLA haplotypes were 
first inferred by CHB [Zhang, Niu and Liu (2006)] from the genotype data. 
We also simulated 1,000 individuals from the inferred HLA haplotypes using 
HAPGEN [Marchini et al. (2007)] to evaluate the performance of BEAM2 
with a larger sample size. As a comparison, we applied HapBlock [Zhang et 
al. (2002b)], a dynamic programming-based algorithm of block partitioning, 
to the same sets of data. 

With 100 haplotypes inferred from the 50 individuals, BEAM2 produced 
accurate block partitions that correspond well to the visual blocks displayed 
by Haploview [Barrett et al. (2005)]. The block boundaries also coincide 
with the known recombination hotspots within the HLA region (Figure 1). 
It is further observed that, for the haplotype data, the blocks inferred by 



BLOCK-BASED BAYESIAN EPISTASIS ASSOCIATION MAPPING 



13 



(0 

g. BM 
S HB1 

I" HB2 
o 



\V^V .-'1 



. 4-y 



HB3 



« BM 

c 

O 



S 



HB1 
HB2 
HB3 



g. BM 
■5 HB1 
I HB2' 

o 
o 
o 



HB3 



50 



100 



150 



200 



250 



50 



100 



150 



200 



250 



— 1 — 

50 



100 



150 



200 



250 





J J II 1 .IJ 1 




1 


1 ,.J , . 


1 1 






,;1 




\ 


J„ J 










• 1 














# 
















1 






1 




f 






; 




\ 














;.[ 















300 





I.IL 




Ji 


y 




LULk 


31, 




II 


i, 




1 II 


1. 




III; 






1 1 










1 










































{; 








W 
















1 






III 






\ 


II 


II 


1: 






III 




lip 


l|li 






i 








I 


- 1 

r 






r 






nil... 


f 
















[1 


•\ 




1 I'llhl 


1 



300 




4 
2 

-2 



300 



Fig. 1. Block partition results on the HLA data. The top panel shows the pairwise SNP 
LD calculated from 100 HLA haplotypes by Haploview. The second, third, and forth panels 
show the block partition results by BEAM2 (BM) and HapBlock (HBl, HB2, HB3) on 
three HLA data sets, respectively. HBl uses rule 1: common haplotypes; HB2 uses rule 2: 
pairwise D' ; and HB3 uses rule 3: four-gamete test. Experimentally verified recombination 
rates are shown as dashed red lines (the unit cM/Mb in the natural logarithm scale is shown 
on the right). 



BEAM2 are very similar to those obtained by HapBlock. Unlike our model- 
based method, HapBlock requires the user to specify ad hoc block partition 
rules, which can result in undesirable partitions. We used three different 
rules to define blocks: (1) common haplotypes, defined as a haplotype >5% 
in the sample, cover 80% of samples in a block; (2) at least 80% SNP pairs 
with D' > 0.5 in a block; or (3) four-gamete test on common haplotypes 
(>5%) in a block. The blocks partitioned by HapBlock using each rule are 
also shown in Figure 1. 

Using the genotype data of the 50 individuals, we obtained very different 
results between BEAM2 and HapBlock, and between the three different 
rules of block partitions. Except for the first rule of HapBlock, all other 
methods produced a large number of small blocks. Small blocks generated 
by BEAM2 are due to the small sample size of 50 individuals, based on which 
the correlation between SNPs is hard to detect using our likelihood model. 
The posterior probabilities of block boundaries output by BEAM2, however, 
can be used as a measure of uncertainty in block partitions. In comparison. 
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Table 1 

Number of blocks inferred m HLA data sets 



Data 


BEAM2 


BEAM2-10p 


HapBlock-1 


HapBlock-2 


HapBlock-3 


100 haplotypes 


32 


39 


20 


30 


34 


50 genotypes 


81 


87 


46 


147 


106 


1,000 genotypes 


35 


36 


56 


184 


268 



"BEAM2-10p" denotes BEAM2 applying a 10 times larger prior probability than the 
default prior on the block boundary variable. HapBlock-1, 2, 3 denotes HapBlock applying 
three different block partition rules. 



HapBlock and many other block partitioning methods only provide a single 
partition solution without measuring block uncertainty. 

Using the simulated genotypes of 1,000 individuals, we again observed 
very different results shown in Figure 1. BEAM2 produced the cleanest 
block partitions that corresponded well to the visual block boundaries and 
to the known recombination hotspots. The D' rule and the four-gamete test 
rule via HapBlock again failed to produce reasonable partitions, of which 
most blocks were singletons. 

We further show in Table 1 the number of blocks inferred by each method 
in the three data sets. We observed that BEAM2 performed well (by which 
we roughly mean that the number of estimated blocks is small, as is true in 
the HLA region) in both the haplotype data and the 1,000 individuals' geno- 
type data. Because modeling genotypes (diplotypes) requires a much larger 
set of parameters than modeling haplotypes, BEAM2 is expected to perform 
worse in the 50 individuals' genotype data. As the number of individuals in- 
creased to 1,000, however, our model-based approach produced very similar 
partitions as that obtained in the haplotype data. In comparison, HapBlock 
only preformed reasonably well in the haplotype data, but produced many 
small blocks and singletons in the other two data sets for all three block par- 
tition rules applied. HapBlock performed the worst in the 1,000 individuals' 
genotype data, indicating that the ad hoc rules applied by HapBlock do not 
produce consistent block partitions as sample size increases. We further show 
in Table 1 additional results by BEAM2 using a 10 times larger prior on the 
block boundary variable, that is, we expect 10 times more blocks a priori. 
We observed that the estimated posterior number of blocks did not increase 
much, particularly in the 1,000 individuals' genotype data, indicating that 
BEAM2 is insensitive to the prior choice of block boundary variables. We 
also ran multiple MCMC chains to ensure proper convergence. 

5.2. Simulation study using HapMap data. To mimic real genetic data 
observed in human populations, we first randomly select a region in the hu- 
man genome that contains 1,000 Illumina HapMap 300k tagSNPs. The re- 
gion also contains about the same number of additional SNPs from HapMap 
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Phasell tagged by these tagSNPs, which we refer to as nontagging SNPs. 
Two nontagging SNPs in the region are randomly selected as disease SNPs. 
Given a disease model, we set the marginal effect size (log odds ratio mi- 
nus 1) per disease SNP at 0.5 and choose a disease minor allele frequency 
(MAF) per locus from (0.05,0.10,0.20,0.50). Given a marginal effect size 
and a choice of MAF, we then calculate the diplotype frequencies over the 
disease SNPs in cases and controls, respectively [this is similarly done as 
presented in Zhang and Liu (2007)]. According to the case-control diplo- 
type frequencies over the disease SNPs, we randomly sample 1,000 cases 
and 1,000 controls from a pool of individuals without replacement. The 
pool consists of 10,000 control individuals generated by HAPGEN [Mar- 
chini et al. (2007)] using HapMap European sample (parents only) at odds 
ratio = 1, that is, no disease association. Our simulation procedure is more 
economical than a direct approach that generates one individual at a time 
and determines its disease status conditional on the disease genotypes and 
penetrance, because the direct approach may generate many more controls 
before obtaining enough cases. Finally, we remove all nontagging SNPs from 
the data including the two disease SNPs (which are typically unobserved in 
a GWA study), and obtain a case-control data set containing 1,000 Illumina 
HapMap tagSNPs. 

To evaluate the association mapping performance of our method, we sim- 
ulated case-control data sets based on the HapMap sample under three 
disease models shown in Table 2. Each disease model assumes 2 loci in the 



Table 2 

Disease models used in simulation study 



Risk 


A/ A 


A/a 


a/a 


Model 1 








B/B 


1 


1 + 9 


{1 + ef 


B/b 


i + e 


{1 + ef 


(1 + 9)^ 
{1 + 9^ 


b/b 


{i + ef 


{1 + ef 


Model 2 








B/B 


1 


1 


1 


B/b 


1 


{i + ef 


(1 + 9)^ 


b/b 


1 


{i + ef 


(1 + 9)^ 


Model 3 








B/B 


1 


1 


1 


B/b 


1 


i + e 


1 + 9 


b/b 


1 


1 + 9 


1 + 9 



Each table cell lists the relative risk of the corresponding genotype com- 
bination. Cenotypes with risks equal to 1 have no effects to the disease. 
The parameter 9 is computed according to the specified marginal effects 
(0.5 in our simulation) and disease MAFs (0.05,0.1,0.2,0.5). 
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genome contributing to the disease risk. While the first model assumes no 
interactions, the other two models assume different types of interactions. 
Using the simulated data sets, we compared the performance of BEAM2 to 
BEAMl. We also implemented a third method that maps associations and 
interactions based on predetermined block structures. This third method 
serves as an intermediate method between BEAMl, which is not block- 
based, and BEAM2, which infers block structures and maps associations 
simultaneously. We used three different levels of parameters (from stringent 
to liberal) to define blocks using existing software, and we treated the diplo- 
types within each inferred block as genetic alleles. The third method using 
three different block definitions are hereafter referred to as Blockl, Block2, 
and Blocks, respectively [more details of the third method can be found in 
the Supplementary Material, Zhang, Zhang and Liu (2011)]. To compare the 
performance of all methods, we ranked SNPs according to the association 
posterior probabilities output by each method estimated for each data set. 
We then calculated how often a method ranked the disease related SNPs 
among the top SNPs. A SNP is regarded as being correctly identified as 
disease related if it is within 5 SNPs on either side of a true disease locus. 

As shown in Figure 2, under all parameter settings, BEAM2 performed the 
best among all tested methods, where Blockl, Block2, Blocks, and BEAMl 
all performed similarly. When disease allele frequency was low (/ = 0.05), 
the power curves of all methods looked similar, but a closer inspection of the 
top 5 ranked SNPs showed that BEAMl only had ~50% chance to capture 
disease related SNPs relative to BEAM2. When disease alleles were common 
in the population (/ = 0.10, 0.20, 0.50), the advantage of the BEAM2 model 
becomes obvious. Comparing the power curves for Model 2 and Model 3, 
we observed that the power of BEAM2 increased much faster than that of 
BEAMl among the top 2 or 3 SNPs. We did not observe this behavior in 
Model 1, which has no interactions. It thus indicates that using SNP-blocks 
can increase the power of mapping both single SNP and multi-SNP inter- 
action associations. All methods compared here are Bayesian methods that 
output posterior probabilities of disease associations. It therefore indicates 
that our treatment of LD in BEAM2 is more appropriate than using either 
predetermined blocks or a Markov chain model (as in BEAMl). 

To further declare statistical significance, existing significance estimation 
methods adjusting for multiple comparisons should be used, such as the 
Bonferroni correction applied to B-stat introduced in BEAMl [Zhang and 
Liu (2007)]. We compared BEAM2 with single SNP chi-square tests using 
the above disease models [Supplementary Table SI, Zhang, Zhang and Liu 
(2011)], and observed that BEAM2 performed better than the chi-square 
test for interaction Model 2 and Model 3, but performed the same for the 
noninteractive Model 1. 



BLOCK-BASED BAYESIAN EPISTASIS ASSOCIATION MAPPING 17 
Modei 1 Model 2 Mndet 3 



d 



o 
d 






CD 

d 



i_ o 
CL 




1 — I — I — [ — I — r 




"1 — I — I — I — r 




1 — I — I — r 



d 



q 
d 




1 1 r 




T r 



1 r 



q 
d 




1 — I — I — I — I — r 

20 40 60 80 100 
# of top SNPs 




1 1 1 1 1 

20 40 60 80 100 
# ol top SNPs 



T — I r 



20 40 60 80 
» o( top SNPs 



100 



Fig. 2. Power comparison for BEAMl, BEAM2, Blockl, Block2, and Blocks using sim- 
ulated data from the HapMap data. Under each simulation setting and from 50 data sets, 
power (y-axis) is calculated as the proportion of disease-associated SNPs (within 5 SNPs 
of true disease loci) among top m SNPs (x-axis), ranked by the posterior probability of as- 
sociation. Each data set contains 1,000 candidate SNPs in 1,000 cases and 1,000 controls. 
The disease allele frequency is 0.05, 0.10, 0.20, and 0.50, respectively. The marginal effect 
size of each disease SNP (unobserved) is 0.5. Pink: BEAMl; Red: BEAM2; Blue: Blockl; 
Green: Block2; Black: Blocks. 



We also checked the performance of our MCMC samphng algorithm. As 
shown in Figure 3, using a simulated data set from disease Model 2, the 
lag of autocorrelation of our Markov chain is short, indicating fast conver- 
gence of the Markov chain. We further compared the posterior distribution 
of SNP associations from 4 independent runs of BEAM2, and we observed 
close agreement between runs. In practice, the Markov chain could converge 
to local modes, particularly if the data contain many SNPs with complicated 
block structures. If block structures are of primary interest, we suggest run- 
ning BEAM2 in several runs to check if the block partition results obtained 
in different runs are consistent. More advanced MCMC algorithms, such 
as parallel tempering [Liu (2001)], could further alleviate the local mode 
problems in MCMC sampling. 

Using BEAM2, we can estimate the number of disease associated SNPs 
around a disease locus by the sum of posterior probabilities of associations 
over all SNPs within a neighborhood of a candidate locus. Given our block- 
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Fig. 3. Performance of MCMC. (a) Autocorrelation plot obtained by running BEAM2 
on a simulated data set of 1,000 cases, 1,000 controls, and 1,000 SNPs under disease Mo- 
del 2. A total of 1,000 iterations after bum-in are used to calculate the plot, (b) Posterior 
probabilities (in logarithm scale) of disease association (marginal and epistatic) per SNP 
compared across 4 independent runs of BEAM2. 



based association model, the number of associated SNPs does not include 
SNPs whose disease association is merely created by LD, and, hence, our es- 
timates are more appropriate than a naive count of significant SNPs within 
the neighborhood. As shown in Figure 4, around a 100-kb neighborhood 
of every simulated disease locus in disease Model 1, the estimated number 
of disease associated SNPs by BEAM2 is around 1 when the association 
signal is sufficiently strong, even if there are many significant SNPs in the 
neighborhood. The extra significant SNPs created by LD make the local- 
ization of disease locus difficult. This result highlights the importance of 
BEAM2 that performs automatic variable selection within blocks. Rather 
than reporting diluted small posterior probability of association over many 
neighboring SNPs in LD, BEAM2 was able to select the strongest contribut- 
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(a) (b) 

Fig. 4. Estimated number of disease associated SNPs (y-axis) plotted against (a) the 
number of significant SNPs (>0.05 family-wise) and (b) the maximum single SNP 
chi-square statistics (x-axis) within a 100 kb neighborhood per disease locus. In (a), the 
number of loci in each box plot is further shown on the top. The plots are computed from 
200 simulated data sets of disease Model 1. 

ing SNP within blocks (with large posterior probabilities of association) in 
our simulation study. 

6. Application to WTCCC type 1 diabetes data. We apphed BEAM2 
to analyze the TID data set from the WTCCC project [WTCCC (2007)]. 
The data set contains 2,000 TID patients, 1,504 controls from 1958 Birth 
Cohort (58C), and 1,500 additional controls from the National Blood Service 
(NBS). Given our limited computation resources (computation time, which 
would require several days to analyze half a million SNPs in this data set; 
and memory usage, which would require >4 Gb for half million SNPs), we 
applied BEAM2 to the top 10% SNPs ranked by marginal associations with 
TID on all autosomes. We further filtered out SNPs with bad clustering, 
SNPs violating Hardy- Weinberg equilibrium in controls at 10~^ significance, 
and "almost nonpolymorphic" SNPs of which >95% samples have the same 
genotypes. The final data set contained 42,470 SNPs in 5,004 individuals. 
We ran BEAM2 with 5 independent MCMC chains. Figure 5 shows the 
averaged posterior probabilities of TID associations. Note that selecting top 
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Fig. 5. SNP-wise posterior probabilities (y-axis) of TID associations m 22 autosomal 
chromosomes (x-axis, chromosomes are separated by grey dashed lines). Previously re- 
ported TID associated genes are highlighted in green, and candidate TID associated genes, 
as defined at TlDbase, are highlighted in red. 
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10% SNPs does not imply sparse and less correlated SNPs, because SNPs 
in LD tend to be in or out of the top 10% list together. Two alternative 
ways to reduce the number of SNPs are to run BEAM2 on each individual 
chromosome for chromosome- wise epistasis, or on previously detected disease 
associated regions. 

6.1. Result summary and highlights. We first compared our results with 
the SNPs reported in the original WTCCC paper (2007). As expected, we 
found that the SNPs reported by BEAM2 and by the original WTCCC anal- 
ysis are highly consistent. All strongly associated SNPs reported in WTCCC 
are significant in our analysis, and all strongly and moderately associated 
SNPs reported in WTCCC have posterior probability > 0.1 by BEAM2 
[Supplementary Table S2, Zhang, Zhang and Liu (2011)]. We further com- 
pared our results with known TID associations obtained from TlDbase 
(www.tldbase.org). Among the 55 SNPs (or cluster of SNPs) output by 
BEAM2 with posterior probabilities greater than 0.1, 17 (31%) overlapped 
with known TID associated regions, including some well-known genes such 
as PTPN22 (lpl3), CTLA4 (2q33), MHC (6p21), and IL2RA (10pl5). 

In addition to the previously reported TID genes, BEAM2 reported some 
novel TID associated loci. A list of likely TID associations detected by 
BEAM2, for both single SNPs (if j5-value < 5e— 7) and two-way joint asso- 
ciations (if value < 5e— 10), is shown in Tables 3 and 4, respectively. For 
example, we detected 7 loci, among which two SNPs in short distance form 
strong joint associations with TID (p- value < 5e— 10). These loci are not 
identifiable using single SNP tests, but captured by BEAM2 as multi-SNP 



Table 3 

Strong single SNP associations with TID 



SNP 




Position 


p- value 


TlDbase 


Gene 


rs6679677* 


chrl: 


113.8-114.2 Mb 





Yes 


PTPN22 


rs9405484 


chr6: 


1.4 Mb 


1.33e-8 


No 


FOXCl 


MHC* 


chr6: 


25-35 Mb 





Yes 


MHC 


rs6592988 


chr7: 


52.1 Mb 


2.87e-7 


Yes 


COBL 


rsll984645 


chr8: 


55.2 Mb 


7.06e-ll 


No 


MRPL15 


rsll782342 


chr8: 


73.9 Mb 


5.70e-12 


No 


KCNB2 


rsll052552 


clirl2: 


9.7 Mb 


2.61e-7 


Yes 


CLEC2D 


rslll71739* 


chrl2: 


54.8 Mb 


2.35e-ll 


Yes 


ERBB3 


rsl7696736* 


chrl2: 


109.8-110.9 Mb 





Yes 


C CDC 63, NAPl 


rsl2924729* 


chrl6: 


11.1 Mb 


l.Ole-7 


Yes 


CLEC16A 



SNPs showing strong associations (p-value < 5e— 7) with TID by single SNP test, p-value: 
nominal p-value of associations. TlDbase: whether the locus is documented in TlDbase. 
Gene: nearest gene. 

'Additional SNPs in its neighborhood also show strong marginal associations. 
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Table 4 

Strong two-SNP associations with TID 



SNPl 


Posl 


Pvall 


Gene 


SNP2 


Pos2 


PvaI2 


Gene 


JointP 


rs7525703 


chrl: 143 


2e- 


2 


PRKAB2 


rs2077749 


chrl 


143 


3e- 


-6 


PRKAB2 


2e 


-12 


rs6809441 


chr3: 41 


4e- 


2 


ULK4 


N/A 


chr3 


41 


7e- 


2 


ULK4 


5e 


-11 


rs330483 


chr4: 176 


4e- 


6 


ADAM29 


rs330504 


chr4 


176 


le- 


1 


ADAM29 


2e 


-11 


rs6906469 


chr6: 10 


le- 


3 


DFCCl 


rs659964* 


chrl2 


111 


2e- 


6 


ACAD 10 


4e 


-11 


rs3132676* 


chr6: 30 


le- 


5 


TRIM 40 


rs9376523 


chr6 


141 


6e- 


'4 


TXLNB 


3e 


-11 


rs9296661 


chr6: 52 


4e- 


4 


PKHDl 


rsl265566* 


chrl 2 


110 


le- 


'6 


CUTL2 


5e 


-11 


rsl3340508 


chr7: 75 


9e- 


3 


CCL24 


rsl7361077 


chr7 


75 


4e- 


-5 


CCL24 


6e 


-12 


rs4838140 


chr9: 124 


2e- 


5 


NEK6 


rs7860360 


chr9 


125 


6e- 


-4 


SCAI 


4e 


-10 


rslll04868 


chrl2: 87 


le- 


4 


KITLG 


rs7961663* 


chrl 2 


110 


3e- 


-6 


CUTL2 


le 


-11 


rsl958305 


chrl4: 23 


5e- 


5 


DHRS2 


rsl2100601 


chrl 4 


23 


2e- 


-2 


DHRS2 







rs7262414 


chr20: 40 


2e- 


5 


PTPRT 


rs2867064 


chr20 


40 


7e- 


-2 


PTPRT 








Pairs of SNPs showing strong joint TID associations (p- value < 5e— 10) by B-stat. If 
multiple SNP pairs are located around the same loci, only one pair is shown. Pvall, 
Pval2, JointP represent nominal p-values of SNPl, SNP2, and their joint associations, 
respectively. 

*SNP lies in known regions in TlDbase. 



associations. We also found some likely long-distance and cross-chromosomal 
interaction associations with TID (p- value < 5e— 11). One example is the 
joint association between SNP rs3132676 in the classic MHC region on chro- 
mosome 6 and SNP rs9376523, which is 111 Mb away on the same chromo- 
some. This SNP pair is likely interacting because their genotypes are strongly 
correlated in cases (nominal p- value 8e— 6 by test of independence), but not 
in controls (nominal p- value 0.91). Although most two-way associations did 
not pass the Bonferroni adjusted significance level in the genome scale, the 
short-distance two-way associations are significant if only considering local 
joint associations. 

We further examined possible confounding effects of population structures 
in the TID data using a logistic regression model. The regional information 
of WTCCC individuals is included as dummy covariates. We observed that 
the test statistics of the detected SNP associations remained almost un- 
changed before and after the adjustment of population origins. We further 
randomly selected 10,000 SNPs genome-wide and compared the distribution 
of their association statistics with a chi-square distribution. The two distri- 
butions agreed well [see Supplementary Figure SI, Zhang, Zhang and Liu 
(2011)]. We therefore believe that population structure does not incur false 
positive associations in the WTCCC TID data. 

We finally checked the block partition results of BEAM2 on the TID data. 
Given the large number of SNPs, we cannot visually inspect the blocks as 
we did for the HLA data. Alternatively, we computed the genetic distance 
between adjacent SNP pairs in the TID data, using a genetic map con- 
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Fig. 6. Frequency (y-axis) of block boundaries between adjacent SNP pairs of certain 
genetic distance (x-axis), calculated from the TID data. The genetic distance is obtained 
from the HapMap CEU sample. 

structed from the HapMap CEU sample. Then, we checked how frequent 
a block boundary is inferred between SNP pairs in certain genetic distance. 
Intuitively, the more distant two SNPs are located, the more likely a block 
boundary occurs. As shown in Figure 6, our method inferred almost 100% 
block boundaries between SNP pairs with genetic distance >1 cM, and the 
frequency of block boundary decreases as the SNP pairs get closer. The 
block partitions between the 5 runs of BEAM2 are consistent, with an av- 
erage correlation coefficient of 0.96. 

6.2. Joint association patterns in the MHC region. TID is an autoim- 
mune disease and genes in the MHC region play an important role in the 
immune system and autoimmunity [Nejentsev et al. (2007); Steenkiste et 
al. (2007)]. BEAM2 found a large number of SNPs within the MHC region 
showing extremely strong association signals with TID. Several multi-locus 
joint associations within the MHC region are also detected. We therefore 
examined more closely a 10-Mb MHC region, including the extended MHC 
region (25-32 Mb) and the classic MHC region (32-35 Mb). 

Within this 10 Mb region, we observed that the SNP pairs associated 
with TID are more often strongly correlated in cases than in controls [see 
Supplementary Figure S2, Zhang, Zhang and Liu (2011)]. The joint associ- 
ations spanned from the classic MHC region to the extended MHC region 
over a distance as long as 6.5 Mb. It has been previously reported that hap- 
lotype blocks containing the most susceptible alleles HLA-DRB1*03 and 
HLA-DRBl *04 within the HLA-DR-DQ region may extend as long as 2 Mb 
[Nejentsev et al. (2007)] into the extended MHC region. It is thus arguable 
that the joint associations observed between the MHC class II region and 
the extended MHC region are due to extensive LD. We used a more tra- 
ditional approach, logistic regression, to test two-way interactions among 
SNPs within MHC conditioning on the HLA-DR-DQ haplotypes. We first 
used CHB [Zhang, Niu and Liu (2006)] to infer haplotypes over HLA-DR- 
DQ genes (32.6-32.8 Mb). After collapsing rare haplotypes with frequencies 
lower than 0.1% into one group and obtaining 91 groups of haplotypes. 
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Fig. 7. LoglO p-values (y-axis) of joint association test between SNPs in the nonclass 
II MHC region (x-axis) and SNPs in the MHC class II region (32.5-32.8 Mb), with main 
effects of the MHC class II SNPs subtracted and conditioned on 78 TID associated haplo- 
types in HLA-DR-DQ. Each SNP has multiple p-values, shown as grey dots, corresponding 
to interactions with different MHC class II SNPs. The black line for the SNP indicates the 
— log 10 p-value of the nonclass II SNP's main effect. A cutoff of 10^"' for the p-values, 
shown as the dashed line, corresponds to roughly one false positive among all tests. 



we regressed the TID status on the 91 haplotype groups using 90 dummy 
variables, which resulted in 78 significant haplotype groups over the HLA- 
DR-DQ genes explaining most of the MHC class II associations with TID 
(the 13 insignificant haplotype groups removed from the model only changed 
the model deviance by 9.7). 

Conditioning on the 78 haplotype groups, we tested both main and inter- 
action effects between pairs of SNPs, one in the nonclass II region (class I, 
class III, and the extended MHC region) and one in the class II region. In 
particular, we regressed the TID status on every pair of SNPs and the 78 
haplotype groups. The association statistic is the change of deviance before 
and after including the two SNPs in the model. We further subtracted the 
main effect of the class II SNPs. As shown in Figure 7, we observed sev- 
eral peaks of p-values demonstrating significant main and interaction effects 
within the nonclass II MHC region. For example, we observed significant 
main effects of gene HIST1H2BD at 26.3 Mb (peak of black lines), inter- 
action effects of gene PRSS16 at 27.3 Mb and ZNF184 at 27.5 Mb (peaks 
of grey dots), strong marginal effects of region 30-31.4 Mb including class I 
genes HLA-A and HLA-B (peaks of black lines), and strong interaction ef- 
fects of gene BATl at 31.6 Mb (peaks of grey dots). The associations of 
PRSS16, ZNF184, and BATl have been previously reported [Nejentsev et 
al. (2007); Viken et al. (2009)], and our analysis further suggested that these 
genes are associated with TID mainly through interactive effects with MHC 
class II genes, controlling the MHC class II haplotypes. 

7. Discussion. In this article we proposed a model-based Bayesian method 
for simultaneous LD-block partitioning and multi-locus epistasis association 
mapping. Different from many block-based methods, we combined block 
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partitioning and association mapping into a unified Bayesian model, where 
both block structures and SNP associations are iteratively learned through 
MCMC sampling. For block partitioning, our simulation study and real data 
analysis showed that BEAM2 produces accurate and consistent partitions 
of SNPs that compared favorably with other genetic knowledge than some 
existing methods. The posterior probabilities of block boundaries output 
by BEAM2 not only report the most likely block partitions but also mea- 
sure the uncertainty of blocks. Compared to some existing methods using 
ad hoc partition rules, BEAM2 has two main advantages: (1) it provides 
soft partitions rather than hard partitions, that is, it provides a posterior 
distribution of block partitions given the data, rather than a single block 
partition result; soft partitions are necessary in regions with no dominant re- 
combination hotspots and block structures; and (2) it scales up well to large 
sample sizes and produces consistent results. In particular, we showed that 
our model-based method produces consistent block partitions as the num- 
ber of individuals increases, whereas the other methods we tested produced 
drastically different results when more individuals from the same population 
are included. For association mapping, BEAM2 is more powerful than both 
the original BEAM algorithm, which accounts for LD using a Markov chain, 
and the methods using pre-estimated blocks. BEAM2 tests disease associa- 
tions over uncertain block structures, where most SNPs around disease loci 
are filtered out, as their associations are created by LD with nearby disease 
SNPs. 

The output of BEAM2 is a list of SNP-wise posterior probabilities of 
marginal and interaction associations. From a frequentist point of view, the 
identified SNPs can be further tested for genome- wide statistical significance. 
We previously introduced a Bayes factor-based statistics, B-stat [Zhang and 
Liu (2007)], for testing the significance of SNP associations. B-stat performs 
similarly to a 2-df chi-square test for single SNPs, but is more powerful for 
testing interactions of multiple SNPs. The same B-stat can be used to test 
the statistical significance of SNPs selected by the BEAM2 model, as we 
demonstrated in this paper. 

When apphed to the WTCCC TID data, BEAM2 found all 5 statisti- 
cally significant loci previously reported in Table 3 of WTCCC (2007), and 
captured 7 moderately (insignificant) associated loci listed in Table 4 of 
WTCCC (2007) with nontrivial posterior probabilities (>0.1). BEAM2 fur- 
ther reported some novel two-way joint associations, including 7 SNP pairs 
(p- value < 5e— 10) in short-distance and 4 SNP pairs (p- value < 5e— 11) in 
long-distance. The local two-way associations indicate main effects of the re- 
lated genes, which are, however, not detectable using single SNP tests alone. 
The long-distance two-way associations did not pass the genome-wide Bon- 
ferroni multiple testing control, but they may be justifiable as real interac- 
tion associations if a better multiple testing method is used and a replication 
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study is performed. We further analyzed the well-known MHC region, and 
our analysis conditioning on the MHC class II haplotypes suggested the ex- 
istence of interaction associations rather than MHC extended linkage alone. 
Given the complex nature and the important biological role of MHC in the 
human genome, our analysis is rather limited. Sophisticated analysis is in 
dire need to further reveal the genetic mechanisms underlying MHC and 
immune diseases. 

The current BEAM model can be further improved in several ways. First, 
missing genotypes and unobserved SNPs in the case-control sample can be 
treated via imputation [Zhang (2011)]. Previous studies have shown that 
imputing untyped SNPs and missing genotypes from a reference panel can 
improve the power of disease association mapping [Marchini et al. (2007)]. 
Second, the current model only reports SNP-wise posterior probabilities of 
associations to the disease without providing a detailed analysis of associa- 
tion structures of the detected SNPs interactions. We have observed in the 
MHC region a bulk of SNPs exhibiting complex association structures with 
TID. A post-analysis of the MHC region is thus needed to delineate fine 
structures of the selected SNPs and interactions with respect to their dis- 
ease effects and inter-relationships. Third, from a statistical point of view, it 
is critical to control the false-discovery rate. This is particularly important 
for multi-locus tests, which often involve a much larger number of simulta- 
neous comparisons than single SNP tests. We are developing new statistical 
methods for evaluating genome-wide statistical significance of associations. 

SUPPLEMENTARY MATERIAL 

Additional Supporting Information and Results 
(DOI: 10. 1214/1 1-AOAS469SUPP; .doc). The file includes a naive SNP- 
block model used in our comparison, verification of population structure 
in the sample, LD analysis of the MHC region, Chi-square results of our 
simulation study, and comparison of our results with previous results in the 
TID WTCCCl data. 
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